Applying Math

Stories from the trenches of science

Improving a Clustering Algorithm of Rodriguez and Laio

You may download this notebook here.

A recent Science paper by Rodriguez and Laio proposed a simple method for clustering using heuristics of local density $\rho_j$ and the minimum distance to a point of higher density $\delta_j$. In their experiments, they then build clusters by noting the cluster centers are outliers that have simultaneously high $\rho_j$ and $\delta_j$. Their definitions are remarkably simple. The local density is given by $$ \rho_j = \sum_k H( d_c - d_{j,k} ) $$ where $H$ is the Heavyside function (one if the argument is greater than zero, zero otherwise) and $d_{j,k}$ is the pairwise distance between points $j$ and $k$. Then distance to a point with higher density is simply $$ \delta_j = \min_{j: \rho_k > \rho_j} d_{j,k}.$$

In this memo, we try their approach using several different datasets. Although they claim their approach is not sensitive cutoff $d_c$, every case tried here has sensitive dependence on $d_c$, producing duplicate cluster centers unless $d_c$ is choosen properly. Using a smoother density kernel, such as the $\chi^2_p$ function with $p$ degrees of freedom, we can produce more robust cluster centers with fewer duplication.

Implementation

Implementing these functions in Python is rather straightfoward. Rather than explicitly forming the pairwise distance matrix, we provide a metric (for which sklearn's is the prototype) and generate the distances on the fly. The following function provides the vector $\rho_j$.

In [2]:
def local_density(metric, data, cutoff):
    """Local density function from RL14.  This is not an efficient implementation!"""
    n = data.shape[0]
    rho = np.zeros(n)
    for j in range(n):
        x = data[j]
        dist = metric(x, data).reshape(n)
        rho[j] += np.sum(dist < cutoff)
    return rho

Similarly we can compute the vector of $\delta_j$ using $\rho_j$.

In [3]:
def higher_density_distance(local_density, metric, data):
    """Minimum distance to point with higher density from RL14."""
    n = local_density.shape[0]
    delta = np.zeros(n)
    for j in range(n):
        # argwhere by default returns a 2-tensor, where a tuple
        greater = np.where(local_density > local_density[j])[0]
        x = data[j]
        if len(greater) == 0:
            # Highest density point
            dist = metric(x,data)
            delta[j] = np.max(dist)
        else:
            dist = metric(x,data[greater])
            delta[j] = np.min(dist)
    return delta

We will want a tool to explore the shape of their key plot of $\rho_j$ vs $\delta_j$ with dependence upon $d_c$, the cutoff distance. Here it is:

In [177]:
Expand Code

Two Clusters and Noise

This first example mimics the first example that appears in their paper. We have two well separated clusters plus noise. They do not mention which distribution they draw their two clusters and noise from, so here we simply generate two tight normally distributed clusters plus one with a large variance, corresponding to noise.

In [148]:
n = [20,5]
np.random.seed(0)
# Centers
center = [np.array([0.2,0.5]), np.array([0.7,0.4])]
sigma = [0.2, 0.1, 0.7]
# Generate data
j = 0
data = np.empty([0,2])
labels = np.empty([0], dtype=np.int16)
for n_, c, s in zip(n, center, sigma):
    data = np.concatenate((data, c + s*np.random.uniform(0, 1, size=(n_,2))))
    labels = np.concatenate((labels, j*np.ones(n_, dtype=np.int16)))
    j += 1
# stray points
for pt in [ np.array([0.2,0.2]), np.array([0.8,0.8]), np.array([0.6,0.1])]:
    data = np.concatenate((data, pt.reshape((1,2))))
    labels = np.concatenate((labels, np.array([2], dtype=np.int16)))

# Plot the points
for k in range(np.max(labels)+1):
    I = np.where(labels == k)[0]
    plt.plot(data[I,0],data[I,1],'o')
plt.xlim(0,1)
plt.ylim(0,1)
Out[148]:
(0, 1)
In [149]:
StaticInteract(lambda cutoff: generate_rho_delta(euclidean_distances, data, cutoff, labels = labels), cutoff=RangeWidget(0,0.1,0.005))
Out[149]:
cutoff:

In the above example, we have a hard time replicating the excellent performance of Figure 1B. In their figure, the local density seems to take on real values, rather than being restricted to integers as their definition of $\rho_j$ allows. This may be a visualization artifact to allow the number of low lying points to be distinguished. However, no value of $d_c$ seems to separate two sole representatives of each class of point. Instead, duplicates often appear of the same class. In the examples they present, points are either distributed evenly (not uniform randomly) or strongly Gaussian.

Alternative Distance Metric

The choice of the Heavyside function in the definition of $\rho_j$ was surely made for simplicity. Of course, we could try another definition. For example, we could use a Chi-Squared distribution with $p$ degrees of freedom, defining $$ \rho_j = \sum_k \chi_p^2\left( \frac{d_c - d_{j,k}}{\sigma^2} \right). $$

In [150]:
def chi_density(metric, data, sigma, df):
    """Local density function from RL14.  This is not an efficient implementation!"""
    n = data.shape[0]
    rho = np.zeros(n)
    for j in range(n):
        x = data[j]
        dist = metric(x, data).reshape(n)
        # prevent overflows when df=1 and summing over self
        rho[j] += np.sum(chi2(df).pdf(dist[0:j]**2/sigma**2))
        rho[j] += np.sum(chi2(df).pdf(dist[j+1:]**2/sigma**2))
    return rho
In [151]:
def generate_rho_delta_chi(metric, data, sigma, df, labels = None):
    """ Generate a rho/delta plot for a particular cutoff
    """
    rho = chi_density(metric, data, sigma, df)
    delta = higher_density_distance(rho, metric, data)
    fig = plot_rho_delta(rho, delta, labels)
    ax = fig.axes[0]
    ax.set_title('sigma = {}; df={}'.format(sigma, df))
    return fig
In [163]:
StaticInteract(lambda **kw: generate_rho_delta_chi(euclidean_distances, data, kw['sigma'], kw['df'], labels = labels),
               sigma=RangeWidget(1,5,1), df = RadioWidget(range(1,5)))
Out[163]:
df: 1: 2: 3: 4:
sigma:

This $\chi^2$ distance metric seems much more robust at separating the two cluster centers.

Gaussian Clusters Example

This approach seems to work better for clusters with a centered density.

In [155]:
# Centers
center = [np.array([0.2,0.5]), np.array([0.7,0.4])]
sigma = []
sigma.append(0.05*np.matrix("1 0; 0 1"))
sigma.append(0.05*np.matrix("1 0.5; 0.5 1"))
n = [400, 100]
data = np.empty([0,2])
labels = np.empty([0], dtype=np.int16)
j = 0
np.random.seed(0)
for n_, c, s in zip(n, center, sigma):
    new_points = np.zeros([n_,2])
    for k in range(n_):
        new_points[k,:] = c + np.dot(s,np.random.normal(0, 1, size = (2,1))).reshape(2)
    data = np.concatenate((data,new_points))
    labels = np.concatenate((labels, j*np.ones(n_, dtype=np.int16)))
    j += 1

# Plot the points
for k in range(np.max(labels)+1):
    I = np.where(labels == k)[0]
    plt.plot(data[I,0],data[I,1],'o')
plt.xlim(0,1)
plt.ylim(0,1)
Out[155]:
(0, 1)
In [156]:
StaticInteract(lambda cutoff: sweep_cutoff(euclidean_distances, data, cutoff, labels = labels), cutoff=RangeWidget(0,0.1,0.005))
Out[156]:
cutoff:

Again, we see that the Heavyside based density has difficultly separating the two clusters uniquely. Oftentimes, we get duplicate node centers. However, if we use the $\chi^2$ based density, duplicate do not appear.

In [162]:
StaticInteract(lambda **kw: generate_rho_delta_chi(euclidean_distances, data, kw['sigma'], kw['df'], labels = labels),
               sigma=RangeWidget(1,20,1), df = RadioWidget(range(1,6)))
Out[162]:
df: 1: 2: 3: 4: 5:
sigma:

NIST Handwriting Example

To test this approach, we next consider the standard NIST Handwritten Digits dataset that is included with SciKit-Learn.

In [164]:
from sklearn import datasets
digits = datasets.load_digits()
n = digits['data'].shape[0]

Here is an example of what these images look like (taken from Joshua Bloom's notes).

In [165]:
plt.figure(figsize=(16, 6))
for i in range(10):
    plt.subplot(1,10,i)
    plt.imshow(digits['images'][i+10],cmap='gray',axes=None,interpolation='nearest')
    plt.grid(visible=False)

To pick a range of distance cutoffs, we look a typical entries distances to other points.

In [134]:
dist = euclidean_distances(digits['data'][0], digits['data']).reshape(n)
plt.hist(dist,50);
trash = plt.title('Histogram of Euclidean Distances');
In [143]:
StaticInteract(lambda cutoff: sweep_cutoff(euclidean_distances, digits['data'], cutoff, labels = digits['target']), 
               cutoff=RangeWidget(10,70,2))
Out[143]:
cutoff:
In [176]:
StaticInteract(lambda **kw: 
               generate_rho_delta_chi(euclidean_distances, digits['data'], 60, kw['df'], labels = digits['target']),
               df = RangeWidget(1,64,1))
Out[176]:
df:
In []: